Pengfei-xue.GitHub.io

programming your life!

Home ~/

View My GitHub Profile

  • vim note

    vim linux下打开中文乱码

    编辑~/.vimrc文件,加上如下几行:

    set fileencodings=utf-8,ucs-bom,gb18030,gbk,gb2312,cp936 set termencoding=utf-8 set encoding=utf-8

  • spark note

    value toDF is not a member of org.apache.spark.rdd.RDD

    Import implicits: Note that this should be done only after an instance of org.apache.spark.sql.SQLContext is created. It should be written as:

    val sqlContext= new org.apache.spark.sql.SQLContext(sc)
    import sqlContext.implicits._
    

    value toDF is not a member of org.apache.spark.rdd.RDD

    How to convert rdd object to dataframe in spark (scala)

    SqlContext has a number of createDataFrame methods that create a DataFrame given an RDD. I imagine one of these will work for your context.

    For example:

    def createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame
    
    // you should import implitcits after create a new SQLContext
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._
    rdd.toDF()
    
    StructType(Seq(
        StructField("request1", MapType(StringType, StringType, true), true),
        StructField("response1", StringType, true)
    ))
    

    How to convert rdd object to dataframe in spark (scala) Converting Map type in Case Class to StructField Type

    How to get hadoop put to create directories if they don’t exist

    hadoop fs -mkdir -p <path>
    

    Spark union of multiple RDDs

    from functools import reduce  # For Python 3.x
    from pyspark.sql import DataFrame
    
    def unionAll(*dfs):
        return reduce(DataFrame.unionAll, dfs)
    
    df1 = sqlContext.createDataFrame([(1, "foo1"), (2, "bar1")], ("k", "v"))
    df2 = sqlContext.createDataFrame([(3, "foo2"), (4, "bar2")], ("k", "v"))
    df3 = sqlContext.createDataFrame([(5, "foo3"), (6, "bar3")], ("k", "v"))
    
    unionAll(df1, df2, df3).show()
    

    Spark union of multiple RDDs

    Create new column with function in Spark Dataframe

    import org.apache.spark.sql.functions._
    val myDF = sqlContext.parquetFile("hdfs:/to/my/file.parquet")
    val coder: (Int => String) = (arg: Int) => {if (arg < 100) "little" else "big"}
    val sqlfunc = udf(coder)
    myDF.withColumn("Code", sqlfunc(col("Amt")))
    

    Spark中加载本地(或者hdfs)文件以及SparkContext实例的textFile使用

    网上很多例子,包括官网的例子,都是用textFile来加载一个文件创建RDD,类似

    sc.textFile("hdfs://n1:8020/user/hdfs/input")
    

    textFile的参数是一个path,这个path可以是:

    1. 一个文件路径,这时候只装载指定的文件
    2. 一个目录路径,这时候只装载指定目录下面的所有文件(不包括子目录下面的文件)
    3. 通过通配符的形式加载多个文件或者加载多个目录下面的所有文件

    第三点是一个使用小技巧,现在假设我的数据结构为先按天分区,再按小时分区的,在hdfs上的目录结构类似于:

    /user/hdfs/input/dt=20130728/hr=00/
    /user/hdfs/input/dt=20130728/hr=01/
    /user/hdfs/input/dt=20130728/hr=23/
    

    具体的数据都在hr等于某个时间的目录下面,现在我们要分析20130728这一天的数据,我们就必须把这个目录下面的所有hr=*的子目录下面的数据全部装载进RDD,于是我们可以这样写:

    sc.textFile("hdfs://n1:8020/user/hdfs/input/dt=20130728/hr=*/"),注意到hr=*,是一个模糊匹配的方式。
    

    http://blog.csdn.net/zy_zhengyang/article/details/46853441

    How to load IPython shell with PySpark

    PYSPARK_DRIVER_PYTHON=ipython /path/to/bin/pyspark
    

    How to load IPython shell with PySpark

    Why does Spark report “java.net.URISyntaxException: Relative path in absolute URI” when working with DataFrames?

    spark-shell --conf spark.sql.warehouse.dir=file:///c:/tmp/spark-warehouse
    
    PYSPARK_DRIVER_PYTHON=ipython pyspark --jars ../spark/jars/mysql-connector-java-5.1.40.jar --conf spark.sql.warehouse.dir=/data/warehouse.dir
    

    Why does Spark report “java.net.URISyntaxException: Relative path in absolute URI” when working with DataFrames?

    datetime range filter in PySpark SQL

    dates = ("2013-01-01",  "2015-07-01")
    date_from, date_to = [to_date(lit(s)).cast(TimestampType()) for s in dates]
    
    sf.where((sf.my_col > date_from) & (sf.my_col < date_to))
    

    datetime range filter in PySpark SQL

    ‘0000-00-00 00:00:00’ can not be represented as java.sql.Timestamp error

    You can use this JDBC URL directly in your data source configuration:

    jdbc:mysql://yourserver:3306/yourdatabase?zeroDateTimeBehavior=convertToNull
    

    Apache Spark: map vs mapPartitions?

    What’s the difference between an RDD’s map and mapPartitions method? The method map converts each element of the source RDD into a single element of the result RDD by applying a function. mapPartitions converts each partition of the source RDD into multiple elements of the result (possibly none).

    And does flatMap behave like map or like mapPartitions? Neither, flatMap works on a single element (as map) and produces multiple elements of the result (as mapPartitions).

    Apache Spark: map vs mapPartitions?

    how to use SparkContext.textFile for local file system

    'file:///home/pengfeix/hydra/log'
    

    spark use local file

    pyspark connect mysql

    From pySpark, it work for me :

    dataframe_mysql = mySqlContext.read.format("jdbc").options(
        url="jdbc:mysql://localhost:3306/my_bd_name",
        driver = "com.mysql.jdbc.Driver",
        dbtable = "my_tablename",
        user="root",
        password="root"
    ).load()
    

    How to work with MySQL DB and Apache Spark?

    How do I add a new column to a Spark DataFrame (using PySpark)?

    from pyspark.sql.functions import lit
    
    df = sqlContext.createDataFrame(
        [(1, "a", 23.0), (3, "B", -23.0)], ("x1", "x2", "x3"))
    
    df_with_x4 = df.withColumn("x4", lit(0))
    df_with_x4.show()
    
    ## +---+---+-----+---+
    ## | x1| x2|   x3| x4|
    ## +---+---+-----+---+
    ## |  1|  a| 23.0|  0|
    ## |  3|  B|-23.0|  0|
    ## +---+---+-----+---+
    

    How do I add a new column to a Spark DataFrame (using PySpark)?

  • scala note

    Scala partial functions

    val fraction = new PartialFunction[Int, Int] { def apply(d: Int) = 42 / d def isDefinedAt(d: Int) = d != 0 }

    List(41, “cat”) map { case i: Int ⇒ i + 1 }

    Scala partial functions

    sbt 加速

    ~/.sbt/repositories 文件,加入

    [repositories]
        local
        aliyun: http://maven.aliyun.com/nexus/content/groups/public/
        central: http://repo1.maven.org/maven2/
    

    How to read Scala command line arguments

    println("Hello, " + args(0))
    
    >>> scala hello.scala Al
    >>> Hello, Al
    

    java properties file location

    The typical way of handling this is to load the base properties from your embedded file, and allow users of the application to specify an additional file with overrides. Some pseudocode:

    Properties p = new Properties();
    InputStream in = this.getClass().getResourceAsStream("c.properties");
    p.load(in);
    
    String externalFileName = System.getProperty("app.properties");
    InputStream fin = new FileInputStream(new File(externalFileName));
    p.load(fin);
    

    Your program would be invoked similar to this:

    java -jar app.jar -Dapp.properties="/path/to/custom/app.properties"
    

    how to read properties file Java Properties file examples

    sbt deduplicate: different file contents found in the following

    % "provided"
    

    将相同的jar中排除一个,因为重复,可以使用”provided”关键字。 例如spark是一个容器类,编写spark应用程序我们需要spark core jar. 但是真正打包提交到集群上执行,则不需要将它打入jar包内。 这是我们使用 % “provided” 关键字来exclude它。

    libraryDependencies ++= Seq(
      "org.apache.spark" %% "spark-core" % "0.8.0-incubating" % "provided",
      "org.apache.hadoop" % "hadoop-client" % "2.0.0-cdh4.4.0" % "provided"
    )
    

    different file contents found in the following

    Include Dependencies in JAR using SBT package

    For sbt 0.13.6+ add sbt-assembly as a dependency in project/assembly.sbt:

    addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
    

    For example, here’s a multi-project build.sbt:

    lazy val commonSettings = Seq(
      version := "0.1-SNAPSHOT",
      organization := "com.example",
      scalaVersion := "2.10.1",
      test in assembly := {}
    )
    
    lazy val app = (project in file("app")).
      settings(commonSettings: _*).
      settings(
        mainClass in assembly := Some("com.example.Main"),
        // more settings here ...
      )
    
    lazy val utils = (project in file("utils")).
      settings(commonSettings: _*).
      settings(
        assemblyJarName in assembly := "utils.jar",
        // more settings here ...
      )
    
      sbt clean assembly // 打包
    

    sbt-assembly how do I get sbt to gather all the jar files my code depends on into one place?

  • python note

    Display an image from a file in an IPython Notebook

    from IPython.display import Image Image(filename=’test.png’)

    In Python, how do I determine if an object is iterable?

    try: some_object_iterator = iter(some_object) except TypeError, te: print some_object, ‘is not iterable’

    try: _ = (e for e in my_object) except TypeError: print my_object, ‘is not iterable’

    import collections

    if isinstance(e, collections.Iterable): # e is iterable

    http://stackoverflow.com/questions/1952464/in-python-how-do-i-determine-if-an-object-is-iterable

    Best way to parse a URL query string

    from urlparse import urlparse, parse_qs URL=’https://someurl.com/with/query_string?i=main&mode=front&sid=12ab&enc=+Hello’ parsed_url = urlparse(URL) parse_qs(parsed_url.query) {‘i’: [‘main’], ‘enc’: [’ Hello ‘], ‘mode’: [‘front’], ‘sid’: [‘12ab’]}

    modue six, urllib.parse

    string format

    content = u'''
        <table><tbody>
          <tr>
            <td> 响应时间平均值 </td>
            <td> {mean:.4f} </td>
          </tr>
          <tr>
            <td> 响应时间75% </td>
            <td> {seventy_five:.4f} </td>
          </tr>
          <tr>
            <td> 响应时间95% </td>
            <td> {ninety_five:.4f} </td>
          </tr>
          <tr>
            <td> 可用性 </td>
            <td> {ava} </td>
          </tr>
        </tbody></table>
    '''.format(\*\*{
        'mean': df.mean()[0],
        'seventy_five': df.quantile(.75)[0],
        'ninety_five': df.quantile(.95)[0],
        'ava': ava
    })
    

    Using % and .format() for great good!

    How can I use Python to get the system hostname?

    import socket
    print(socket.gethostname())
    
    import platform
    platform.node()
    

    http://stackoverflow.com/questions/4271740/how-can-i-use-python-to-get-the-system-hostname

    Python: defaultdict of defaultdict?

    defaultdict(lambda : defaultdict(int))
    

    Python: defaultdict of defaultdict

    get random string

    Generating strings from (for example) lowercase characters:

    import random, string
    
    def randomword(length):
       return ''.join(random.choice(string.lowercase) for i in range(length))
    
    Results:
    >>> randomword(10)
    'vxnxikmhdc'
    >>> randomword(10)
    'ytqhdohksy'
    

    get random string

  • pandas note

    iter rows over dataframe

    >>> df = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])
    >>> row = next(df.iterrows())[1]
    >>> row
    int      1.0
    float    1.5
    Name: 0, dtype: float64
    >>> print(row['int'].dtype)
    float64
    >>> print(df['int'].dtype)
    int64
    

    pandas.DataFrame.iterrows